{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cefa9cf6-e043-4b75-b416-a0b26c8cb3ad",
   "metadata": {},
   "source": [
    "**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "    make venv \n",
    "    source venv/bin/activate \n",
    "    pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a84e965-feeb-424d-9263-9f127e53a1aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "%pip install data-prep-toolkit\n",
    "%pip install data-prep-toolkit-transforms[hap]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d695832-16bc-48d3-a9c3-6ce650ae4a5c",
   "metadata": {},
   "source": [
    "**** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows:\n",
    " - model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier ibm-granite/granite-guardian-hap-38m.\n",
    " - annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.\n",
    " - doc_text_column - the column name containing the document text in the input .parquet file. Defaults to contents.\n",
    " - batch_size - modify it based on the infrastructure capacity. Defaults to 128.\n",
    " - max_length - the maximum length for the tokenizer. Defaults to 512."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f9dbf94-2db4-492d-bbcb-53ac3948c256",
   "metadata": {},
   "source": [
    "***** Import required classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17306684-306b-48e8-a89a-4d0228e01291",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt_tab to /Users/touma/nltk_data...\n",
      "[nltk_data]   Package punkt_tab is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "from dpk_hap.transform_python import HAP"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f443108f-40e4-40e5-a052-e8a7f4fbccdf",
   "metadata": {},
   "source": [
    "***** Setup runtime parameters for this transform and invoke transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a8ec5e4-1f52-4c61-9c9e-4618f9034b80",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07:12:05 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n",
      "07:12:05 INFO - pipeline id pipeline_id\n",
      "07:12:05 INFO - code location None\n",
      "07:12:05 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n",
      "07:12:05 INFO - data factory data_ max_files -1, n_sample -1\n",
      "07:12:05 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "07:12:05 INFO - orchestrator hap started at 2024-12-11 07:12:05\n",
      "07:12:05 INFO - Number of files is 1, source profile {'max_file_size': 0.10423946380615234, 'min_file_size': 0.10423946380615234, 'total_file_size': 0.10423946380615234}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing batch: 0/11\n",
      "Processing batch: 1/11\n",
      "Processing batch: 2/11\n",
      "Processing batch: 3/11\n",
      "Processing batch: 4/11\n",
      "Processing batch: 5/11\n",
      "Processing batch: 6/11\n",
      "Processing batch: 7/11\n",
      "Processing batch: 8/11\n",
      "Processing batch: 9/11\n",
      "Processing batch: 10/11\n",
      "Processing batch: 11/11\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07:12:38 INFO - Completed 1 files (100.0%) in 0.458 min\n",
      "07:12:38 INFO - Done processing 1 files, waiting for flush() completion.\n",
      "07:12:38 INFO - done flushing in 0.0 sec\n",
      "07:12:38 INFO - Completed execution in 0.543 min, execution result 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    doc_id                                           contents  hap_score\n",
      "0        1  GSC is very much a little Swiss Army knife for...   0.002463\n",
      "1        2  When you’ve got a particular somebody that you...   0.075292\n",
      "2        3  Many believe a healthy diet is all that’s need...   0.005342\n",
      "3        4  Clinton’s plan specifically targets drugs that...   0.000294\n",
      "4        5  An aspiring actress was found hanged in a ward...   0.071551\n",
      "5        6  By Evan Ackerman\\nInside this rather large box...   0.000441\n",
      "6        7  I'm really bad at naming things. Like hopeless...   0.131478\n",
      "7        8  Metrolink is successful because of the continu...   0.000337\n",
      "8        9  Notre Dame political scientist Jeff Harden is ...   0.006692\n",
      "9       10  Federal employees who qualify for the federal ...   0.014998\n",
      "10      11  Girona will face Deportivo Alaves at Estadio M...   0.005883\n",
      "11      12  Xana & Melody's Foot Worship Punishment\\nA sla...   0.019838\n",
      "12      13  \"But the liberal deviseth liberal things; and ...   0.001782\n",
      "13      14  Workers at Linden Hills Co-op won their electi...   0.002168\n",
      "14      15  The principal features in PCOD are no ovulatio...   0.000567\n",
      "15      16  Kite Kali Beh Ke gurpreet dhanoa 11 years ago....   0.019736\n",
      "16      17  Chrsyo X1 Bag\\nLittle background information, ...   0.063229\n",
      "17      18  The season of spooks has arrived in Sea of Thi...   0.110339\n",
      "18      19  Box Office Mojo reports that Casino Royale has...   0.000420\n",
      "19      20  Here are only a few examples. And no, I'm not ...   0.989713\n",
      "20      21  Free Fiction Monday: Cosmic Balances Inc.\\nGri...   0.092669\n",
      "21      22  Motorcycle crash kills 61-year-old Hellam man ...   0.001743\n",
      "22      23  You are invited to a family-friendly geocachin...   0.000516\n",
      "23      24  Evaluation Forms are commonly used by employer...   0.001409\n",
      "24      25  4. Pray for the Nation (the Multitudes):\\nWe r...   0.356732\n",
      "25      26  Welcome to the PC Matic Process Library. We ma...   0.000680\n",
      "26      27  By Manpreet Singh, 2009 MBA and President of S...   0.000980\n",
      "27      28  Lower Dauphin School District\\nCourse Title: C...   0.000566\n",
      "28      29  Learn this in-depth pet air journey informatio...   0.001524\n",
      "29      30  Marc Jones, Chief Technology Officer, Alkami\\n...   0.001189\n",
      "30      31  Marsh, H. W, Parker, P. D & Morin, AJ. (2016)....   0.001947\n",
      "31      32  The King of The South T.I.P returns yet again ...   0.011320\n",
      "32      33  No prices to compare at the moment.\\nWhat is A...   0.026912\n",
      "33      34  Hanover Park police and firefighter/paramedics...   0.001274\n",
      "34      35  Julie Buchanan - Your wedding celebrant\\nI am ...   0.001094\n",
      "35      36  - Open Access\\nNorepinephrine enhances the LPS...   0.174337\n",
      "36      37  Are you an Amazon Echo or Echo dot owner? Woul...   0.006586\n",
      "37      38  Ninth day of my apprenticeship. Went dumpster ...   0.003054\n",
      "38      39  Regret and rue and remorse are all from the pa...   0.102948\n",
      "39      40  Can i we take Immune globulin intravenous rout...   0.004886\n",
      "40      41  Whether you’re single or have a significant ot...   0.049372\n",
      "41      42  Short-lived or infrequent episodes of stress p...   0.001169\n",
      "42      43  Unknown > Shiny Poison\\nNumber in series: Tags...   0.044633\n",
      "43      44  Al Jazira confirm Eric Gerets as new coach to ...   0.000965\n",
      "44      45  Mortgage Refinance Rates. Compare current, cus...   0.176559\n",
      "45      46  Being an independent structural division, the ...   0.003829\n",
      "46      47  After getting the shaft from schedule makers t...   0.000514\n",
      "47      48  gift of love, loyalty, and companionship\\npupp...   0.067827\n",
      "48      49  PULL APART HEART\\nGold Coast indie rockers Eli...   0.002925\n",
      "49      50  Food Technology school trips to Greece\\nStuden...   0.000471\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create parameters\n",
    "HAP(input_folder=\"test-data/input\",\n",
    "        output_folder=\"output\",\n",
    "        model_name_or_path= 'ibm-granite/granite-guardian-hap-38m',\n",
    "        annotation_column= \"hap_score\",\n",
    "        doc_text_column= \"contents\",\n",
    "        inference_engine= \"CPU\",\n",
    "        max_length= 512,\n",
    "        batch_size= 128,\n",
    "        ).transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd4ad5c-a1d9-4ea2-abb7-e43571095392",
   "metadata": {},
   "source": [
    "**** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f21d5d9b-562d-4530-8cea-2de5b63eb1dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/metadata.json', 'output/test1.parquet']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the outputs will be located in the following folders\n",
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cd3367a-205f-4d33-83fb-106e32173bc0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
